Jess Goddard, Rich Pauloo, Ryan Shepherd, Noor Brody
Last updated 2022-07-01 14:57:02
In this report, we present a nationwide water service spatial boundary layer for community water systems and explain the tiered approach taken towards this end. The water service boundary layer produced in this report should not be interpreted as a final product, but rather, one that may be improved with the assimilation of additional water service boundaries from states and improvements to the methodology.
This report builds on a previous technical memorandum (eda_february.html) that should be read as prerequisite material to understand the broader context, background, and exploratory data analysis (EDA) that informs the approach taken herein. The resulting national water service boundary layer is the product of a “Tiered Explicit, Match, and Model” (henceforth, TEMM) approach. The TEMM is composed of three hierarchical tiers, arranged by data and model fidelity.
Below is a conceptual diagram of the three tiers in the TEMM approach. Tier 1 explicit boundaries always supersede Tier 2 matched proxy boundaries, which in turn always supersede Tier 3 modeled boundaries. Thus, the resulting the resulting water service boundary layer described in this report is combination of all three tiers, and depends on data availability for the water system, and whether or not it matches to a TIGER place.
Figure 1.1: Conceptual diagram of water service boundary tiers
In the sections that follow, we summarize our approach for each of the three tiers. As we move from Tier 1 to Tier 3, uncertainty increases, hence, we describe each Tier with increasing detail and provide measures of validation and uncertainty for Tier 2 and 3 estimates.
Finally, we encourage the reader to consider the TEMM water service boundary layer not as a final product, but rather, one that may be improved with the assimilation of additional Tier 1 explicit data, improvements to the Tier 2 matching algorithm, and refinement of the Tier 3 model. Ultimately, this entire workflow may be superseded by nationwide Tier 1 data, which would reduce the problem we address in the scope of work to one of simple Tier 1 data assimilation and cleaning.
Data sources for the TEMM layer are shown below.
| Data Source | Description | Link |
|---|---|---|
| EPA Safe Drinking Water Information Systems | Public water systems “master list” with key attribute data and some tabular geographic data like cities served | SDWIS MODEL |
| EPA Enforcement and Compliance History Online | Public water system facilities archive, drawing lat/long data for facilities centroids from FRS | ECHO Exporter |
| EPA Facilities Registry Service | FRS regularly updates facilities data with lat/long information, which pipes into ECHO | FRS Geospatial |
| US Census Bureau TIGER/Line (also called “TIGER Places”) | US Census data of places–cities and towns–used to identify potential service area boundaries | TIGER/Line Shapefiles accessed using R package tigris |
| Unregulated Contaminant Monitoring Rule | UCMR 3 and UCMR 4 provide data on pwsid and zip codes served, which can provide higher integrity centroids where needed | UCMR Occurence data |
| Homeland Infrastructure Foundation-level Data | Mobile home park centroids | MHP centroids |
| Labeled Water System Boundaries | URLs on state pages for various water service boundary sources | State URL tracking doc |
library(tidyverse)
library(sf)
library(fs)
library(rcartocolor)
library(geofacet)
library(mapview)
# don't allow fgb streaming: delivers self-contained html
mapviewOptions(fgb = FALSE)
# load environmental variable for staging path
staging_path <- Sys.getenv("WSB_STAGING_PATH")
output_path <- Sys.getenv("WSB_OUTPUT_PATH")
# read TEMM spatial output for key resul summary stats below
temm <- dir_ls(path(output_path, "temm_layer/"), glob = "*geojson") %>%
# get the most up to date temm spatial geojson layer if there are many
sort() %>% rev() %>% .[1] %>%
st_read(quiet = TRUE) %>%
# drop this column because it's read as a list and causes mapview trouble
select(-service_area_type_code)
# Tier 1 labeled data
wsb_labeled_clean <- path(staging_path, "wsb_labeled_clean.geojson") %>%
st_read(quiet = TRUE)
# read the clean matched output and perform minor transforms for plots.
# this has the same number of rows as `temm`, but contains Tier 1 "radius"
matched_input <- read_csv(path(staging_path, "model_input_clean.csv")) %>%
mutate(
# transform radius from m to km
radius = radius/1000)
# combine to get complete list
d <- matched_input %>%
left_join(temm %>%
select(pwsid, tier), on = "pwsid") %>%
select(-geometry)
# total water system count where pop available
nrow_d <- nrow(d)
# population served by all water systems
# Note there are systems with missing population, ignore
pop_total <- sum(d$population_served_count, na.rm = TRUE)
# calculate count and proportion of people served by each tier
pop <- d %>%
filter(!is.na(population_served_count)) %>%
group_by(tier) %>%
summarize(
count = format_bign(sum(population_served_count)),
prop = round((sum(population_served_count)/pop_total)*100, 2)
)
# read in sdwis original set of water systems for comparisons
acws <- read_csv(path(staging_path, "sdwis_water_system.csv")) %>%
filter(pws_activity_code == "A" &
pws_type_code == "CWS")
# number of active, community water systems in SDWIS
n_acws <- acws %>% nrow()
# population served by active, community water systems in SDWIS
pop_acws <- sum(acws$population_served_count)
# get full list of temm systems (incl those without population), "a" for all
a <- temm
# remove geometry
st_geometry(a) <- NULL
# calculate count and proportion of water systems by each tier
sys <- a %>%
group_by(tier) %>%
summarize(
count = n(),
prop = round((sum(count)/n_acws)*100, 2)
)The following outline reflects a summary of key findings, followed by a description of each Tier in the TEMM approach. Notably, the TEMM may be refined and re-run as new data sources are ingested or improvements are made to matching and modeling algorithms.
Key Results
Tier 1: Explicit boundaries
Tier 2: Matched TIGER proxy boundaries
Tier 3: Modeled boundaries
Models are based on centroids provided by ECHO and FRS. Where geometry accuracy is low (i.e. county centroid or state centroid), centroids of zip codes served from UCMR or mobile home park centroids are used, if available.
Random Forest, xgboost, and multiple linear regression models were fit to Tier 1 labeled boundaries to develop a model that estimates water system radii and spatial boundaries at all systems nationwide.
Linear regression models had the lowest error and best fit.
Circular water service boundary estimates were then calculated for all water systems; median, 5% and 95% confidence intervals reflect medium, low, and high water service boundary estimates.
The multiple linear regression fit has an impressive \(R^2 = 0.64\) and is mostly explained by service connection count, which intuitively makes sense: there is roughly linear scaling between service connection count and the spatial extent of a water service area. EDA, Random Forest, and xgboost models confirm the importance of service connection count as an explanatory variable.
Combine Tiers
Limitations
Recommendations
Develop an approach to refine Tier 2b polygons to disaggregate the boundary to the underlying water systems – accomplished in this latest iteration.
Centroid accuracy and the “pancake problem” may be improved by “re-centering” low-accuracy centroids with city/town spatial databases or acquiring new centroids from states – next steps.
Some water systems have a primacy agency code that conflicts with the spatial location (i.e., coordinates) of the water system. For example, water system with primacy agency code of state A, may have a centroid located in state B. This could be resolved by additional data cleaning.
Contributions
The key result of this study is a nationwide water service boundary layer. Here we show the proportion of population served by each TEMM Tier at nationwide and statewide scales.
In total, the TEMM data layer represents tap water delivery to 307.11 million people served by 44,786 water systems3. This amounts to 97.14% of the population reportedly served by active community water systems in SDWIS and 90.62% of active community water systems in SDWIS. Note that some systems report population of 0 or 1; these are wholesalers and therefore there is some inaccuracy around the total counts for population served and total population.
About 136 million people are covered by Tier 1 spatial data – impressive given that only 14 states that provided explicit boundary data. However, this relatively high coverage rate is unsurprising because these states (AZ, CA, CT, IL, KS, MO, NC, NJ, NM, OK, PA, TX, UT, WA) include notable centers of high-population like CA, TX, and PA.
Together, around 262 million people (85.33% of the population) are covered by either a Tier 1 or a Tier 2a spatial boundary. The remaining approximately 45 million people (14.67%) represented in the layer are covered by a less accurate, Tier 3 boundary. These results indicate relatively high confidence in the spatial accuracy of the resulting TEMM water boundary layer for a majority of the population served by community water systems.
pop %>%
kable(col.names = c("Tier", "Population count",
"Population proportion (%)"),
align = rep("l", 3),
caption = "Population coverage by tier (for systems with population reported)") %>%
kableExtra::kable_styling(full_width = FALSE)| Tier | Population count | Population proportion (%) |
|---|---|---|
| Tier 1 | 136,404,105 | 44.42 |
| Tier 2a | 125,652,248 | 40.91 |
| Tier 3 | 45,052,200 | 14.67 |
Together, around 28441 community water systems (57.54% of systems in layer) are covered by either a Tier 1 or a Tier 2a spatial boundary. Approximately 17519 water systems (35.45%) are covered by a Tier 3 boundary. There are 3464 systems (7.01%) of systems have no geometry associated with them and are therefore not represented in the final TEMM layer.
sys %>%
kable(col.names = c("Tier", "CWS\ncount",
"CWS\nproportion (%)"),
align = rep("l", 3),
caption = "Community water system (CWS) coverage by tier") %>%
kableExtra::kable_styling(full_width = FALSE)| Tier | CWS count | CWS proportion (%) |
|---|---|---|
| none | 3464 | 7.01 |
| Tier 1 | 16915 | 34.22 |
| Tier 2a | 11526 | 23.32 |
| Tier 3 | 17519 | 35.45 |
Next, we show the proportion of population covered in the final spatial layer per TEMM Tier on a state-by-state basis. Notably:
# state pop from full sdwis list
state_pop <- acws %>%
group_by(primacy_agency_code) %>%
summarize(state_pop = sum(population_served_count))
# dataframe for barplot geofacet: population prop served by each tier
# using all systems
ag <- a %>%
group_by(primacy_agency_code, tier) %>%
left_join(state_pop, on = "primacy_agency_code") %>%
# calculate count/proportion of population served per state and tier
summarise(
count = sum(population_served_count, na.rm = TRUE),
prop = count/state_pop
) %>%
ungroup() %>%
distinct() %>%
# add missing NA values per state code and tier group
complete(primacy_agency_code, tier,
fill = list(count = 0, prop = 0))
# sanity check the grouped summary above: this should return all 1
# group_by(ag, primacy_agency_code) %>%
# summarise(s = sum(prop, na.rm = TRUE)) %>%
# pull(s)
# geofacet of TEMM tier coverage per state in terms of population proportion
ag %>%
ggplot(aes(fct_rev(tier), prop, fill = tier)) +
geom_col() +
coord_flip() +
scale_fill_carto_d(direction = -1) +
scale_y_continuous(breaks = c(0, 1, 0.5),
labels = c(0, 1, 0.5)) +
geofacet::facet_geo(~primacy_agency_code) +
labs(x = "", y = "Proportion of Population Covered") +
guides(fill = "none") +
theme_minimal(base_size = 6) +
theme(panel.grid.minor.x = element_blank())Figure 3.1: TEMM Tier distribution by state
Native American water systems are designated in their water system id (pwsid) by their respective EPA region. For a full map of EPA regions see here. Below we show that most of these systems lack Tier 1 data and have generally higher proportion of Tier 3 data compared to state primacy codes.
ag %>%
# filter to Native American primacy codes
filter(! primacy_agency_code %in% c(state.abb, "DC")) %>%
ggplot(aes(fct_rev(tier), prop, fill = tier)) +
geom_col() +
coord_flip() +
scale_fill_carto_d(direction = -1) +
scale_y_continuous(breaks = c(0, 1, 0.5),
labels = c(0, 1, 0.5)) +
facet_wrap(~primacy_agency_code) +
labs(x = "", y = "Proportion of Population Covered") +
guides(fill = "none") +
theme_minimal(base_size = 6) +
theme(panel.grid.minor.x = element_blank())Figure 3.2: TEMM Tier distribution of systems in Native American territories-by EPA agency
# Data on number of native american territory systems overall vs. final set
nam_rep <- d %>% filter(! primacy_agency_code %in% c(state.abb, "DC"),
! primacy_agency_code %in% c("PR", "VI", "NN", "AS", "MP", "GU")) %>% nrow()
nam <- acws %>% filter(! primacy_agency_code %in% c(state.abb, "DC"),
! primacy_agency_code %in% c("PR", "VI", "NN", "AS", "MP", "GU")) %>% nrow()In total, of the 569 systems in Native American territories nationwide, the final database after data loss due to missing centroids represents 337 (59.23%) systems in Native American territories. A data table of the the above plots is provided below.
## <table width=100%> <caption>(#tab:datatable-temm-tier-distribution-per-state)TEMM tier distribution by state</caption> </table>
Next, for each state, we quantify the proportion of systems in each Tier by water size based on population categories according to EPA guidelines:
tribble(
~`EPA classification`, ~`Population Size`,
"Very Small", "500 or less",
"Small", "501 - 3,300",
"Medium", "3,301 - 10,000",
"Large", "10,001 - 100,000",
"Very Large", ">100,000"
) %>%
kable(align = rep("l", 2),
caption = "EPA system size categorization") %>%
kableExtra::kable_styling(full_width = FALSE)| EPA classification | Population Size |
|---|---|
| Very Small | 500 or less |
| Small | 501 - 3,300 |
| Medium | 3,301 - 10,000 |
| Large | 10,001 - 100,000 |
| Very Large | >100,000 |
In states that lack Tier 1 boundaries, “Very Small” and “Small” systems tend to have higher proportions of Tier 3 boundaries (grey bars) compared to larger systems. This result is unsurprising because Tier 2 spatial matches hinge on connecting water systems to Census “Places” which tend to be more populous areas. We therefore expect that smaller systems lack a Tier 2 match, and are hence covered by Tier 3 boundaries. When Tier 1 data is present, it seems to cover all system sizes (i.e., CA, NJ, TX), with a slight bias towards covering larger systems (e.g., OK, WA, CT).
# water system size levels in table above
size_lvls <- c("Missing Pop.", "Very Small", "Small", "Medium", "Large", "Very Large")
# dataframe for barplot geofacet: prop of each each tier per size class
ag2 <- a %>%
mutate(
size_class = case_when(
# in 99% of cases, pop served = 0 or 1 indicates wholesaler, but reporting as "missing pop"
population_served_count == 0 | population_served_count == 1 ~ "Missing Pop.",
population_served_count <= 500 ~ "Very Small",
population_served_count > 500 & population_served_count <= 3300 ~ "Small",
population_served_count > 3300 & population_served_count <= 10000 ~ "Medium",
population_served_count > 10000 & population_served_count <= 100000 ~ "Large",
population_served_count > 100000 ~ "Very Large"),
size_class = factor(size_class, levels = size_lvls)
) %>%
group_by(primacy_agency_code, size_class) %>%
# total count of tiers per state and size class
mutate(size_class_count = n()) %>%
ungroup() %>%
group_by(primacy_agency_code, size_class, tier) %>%
# calculate count/proportion of tiers per state and size class
summarise(
count = n(),
prop = count/size_class_count
) %>%
ungroup() %>%
distinct() %>%
# add missing NA values per state code and tier group
complete(primacy_agency_code, size_class, tier,
fill = list(count = 0, prop = 0))
# geofacet of TEMM tier coverage per state in terms of population proportion
ag2 %>%
ggplot(aes(size_class, prop, fill = tier)) +
geom_col(position = "stack") +
coord_flip() +
scale_fill_carto_d(direction = -1) +
scale_y_continuous(breaks = c(0, 1, 0.5),
labels = c(0, 1, 0.5)) +
geofacet::facet_geo(~primacy_agency_code) +
labs(x = "",
y = "Proportion of Systems Covered by Each Tier, Split by System Size",
subtitle = "Missing Pop indicates recorded population of 0 or 1, usually wholesaler",
fill = "Tier") +
theme_minimal(base_size = 6) +
theme(panel.grid.minor.x = element_blank(), legend.position = c(0.95, 0.15))Figure 3.3: TEMM Tier distribution by system size
Repeating the above analysis for primacy codes indicating Native American territories, we see that Native American systems are generally small or very small according to EPA size classification.
ag2 %>%
filter(! primacy_agency_code %in% c(state.abb, "DC")) %>%
ggplot(aes(size_class, prop, fill = tier)) +
geom_col(position = "stack") +
coord_flip() +
scale_fill_carto_d() +
scale_y_continuous(breaks = c(0, 1, 0.5),
labels = c(0, 1, 0.5)) +
facet_wrap(~primacy_agency_code) +
labs(x = "",
y = "Proportion of Systems Covered by Each Tier, Split by System Size (Population)",
subtitle = "Missing Pop indicates recorded population of 0 or 1, usually wholesaler",
fill = "Tier") +
theme_minimal(base_size = 6) +
theme(panel.grid.minor.x = element_blank())Figure 3.4: TEMM Tier distribution of systems in Native American territories-by system size
A data table of Tier categories across system sizes, nationwide, is provided below. Proportion represents the proportion of Tier within the size class.
## <table width=100%> <caption>(#tab:datatable-temm-tier-distribution-by-system-size)TEMM Tier distribution by system size</caption> </table>
A data table of the the above plots, looking at Tier counts by primacy agency code, is provided below.
## <table width=100%> <caption>(#tab:datatable-temm-tier-distribution-per-state-and-system-size)TEMM Tier distribution by system size and primacy agency code</caption> </table>
The national water service boundary layer is too large (i.e., around 400 MB) to plot in this interactive report, thus we show a static map below to illustrate the coverage provided by the Tiers 1-3 in the proportions described above, and direct the reader to the TEMM spatial layer in the project repository.
include_graphics(here("docs/img/temm-nation.png"))Figure 3.5: Nationwide TEMM water system boundary layer
Here, we zoom into California, Nevada, and Oregon – three states with high proportions of Tier 1, 2, and 3 spatial boundaries respectively. The Tier 3 circular buffers in this interactive map represent median (50th percentile) model estimates. Click on the layers icon to toggle between Tiers.
# plot CA, NV, OR
temm %>%
filter(primacy_agency_code %in% c("CA", "NV", "OR"),
tier != "none") %>%
select(pwsid, service_connections_count, tier) %>%
mapview::mapview(zcol = "tier", burst = TRUE)